INTERSPEECH.2011 - Speech Recognition

Total: 154

#1 Region dependent transform on MLP features for speech recognition [PDF] [Copy] [Kimi1]

Authors: Tim Ng ; Bing Zhang ; Spyros Matsoukas ; Long Nguyen

In this work, Region Dependent Transform (RDT) is used as a feature extraction process to combine the traditional short-term acoustic features with the features derived from Multi-Layer Perceptrons (MLP) which is trained from the long-term features. When compared to the conventional feature augmentation approach, substantial improvement is obtained. Moreover, an improved RDT training procedure in which speaker dependent transforms are take into account is proposed for feature combination in the Speaker Adaptive Training. By incorporating the higher dimensional features output from the layer prior to the bottleneck layer into our Speech-to-Text (STT) system using RDT, significant improvement is achieved as compared to using the conventional bottleneck features. In summary, by using the features derived from MLP with RDT, 8.2% to 11.4% relative reduction in Character Error Rate is achieved for our Mandarin STT systems.

#2 Discriminant sub-space projection of spectro-temporal speech features based on maximizing mutual information [PDF] [Copy] [Kimi1]

Authors: Martin Heckmann ; Claudius Gläser

We previously developed noise robust Hierarchical Spectro- Temporal (Hist) speech features. The learning of the features was performed in an unsupervised way with unlabeled speech data. In a final stage we deployed Principal Component Analysis (PCA) to reduce the feature dimensions and to diagonalize them. In this paper we investigate if a discriminant projection can further increase the performance. We maximize the mutual information between the features and the phoneme categories using a procedure known as Maximizing Renyi's Mutual Information (MRMI) and also compare it to Linear Discriminant Analysis (LDA). Based on recognition tests in clean and in noise, i.e. in matching and mismatching conditions, we show that the discriminant projections increases recognition scores compared to PCA in matching conditions. However, this improvement does not transfer to the mismatching, i.e. noisy, conditions. We discuss measures to alleviate this problem. Overall MRMI performs better than LDA.

#3 Combining feature space discriminative training with long-term spectro-temporal features for noise-robust speech recognition [PDF] [Copy] [Kimi1]

Authors: Takashi Fukuda ; Osamu Ichikawa ; Masafumi Nishimura

Discriminative training of feature space using maximum mutual information (fMMI) objective function has been shown to yield remarkable accuracy improvements. For noisy environments, fMMI can be regarded as an effective noise compensation algorithm and can play a significant role for noise robustness. Feature space speaker adaptation techniques such as feature space maximum likelihood linear regression (fMLLR) are also well known, suitable for mismatched test data. These feature space transform algorithms are essential for modern speech recognition but still need further improvement against low SNR conditions. In contrast, longterm spectro-temporal information has also received attention to support traditional short-term features. We previously proposed long-term temporal features to improve ASR accuracy for low SNR speech. In this paper, we show that long-term temporal features can be combined with fMMI to build more discriminative models for noisy speech and the proposed method performed favorably at low SNR conditions.

#4 Combining frame and segment level processing via temporal pooling for phonetic classification [PDF] [Copy] [Kimi1]

Authors: Sumit Chopra ; Patrick Haffner ; Dimitrios Dimitriadis

We propose a simple, yet novel, multi-layer model for the problem of phonetic classification. Our model combines a frame level transformation of the acoustic signal with a segment level phone classification. Our key contribution is the study of new temporal pooling strategies that interface these two levels, determining how frame scores are converted into segment scores. On the TIMIT benchmark, we match the best performance obtained using a single classifier. Diversity in pooling strategies is further used to generate candidate classifiers with complementary performance characteristics, which perform even better as an ensemble. Without the use of any phonetic knowledge, our ensemble model achieves a 16.96% phone classification error. While our data-driven approach is exhaustive, the combinatorial inflation is limited to the smaller segmental half of the system.

#5 Improved bottleneck features using pretrained deep neural networks [PDF] [Copy] [Kimi1]

Authors: Dong Yu ; Michael L. Seltzer

Bottleneck features have been shown to be effective in improving the accuracy of automatic speech recognition (ASR) systems. Conventionally, bottleneck features are extracted from a multilayer perceptron (MLP) trained to predict context-independent monophone states. The MLP typically has three hidden layers and is trained using the backpropagation algorithm. In this paper, we propose two improvements to the training of bottleneck features motivated by recent advances in the use of deep neural networks (DNNs) for speech recognition. First, we show how the use of unsupervised pretraining of a DNN enhances the network's discriminative power and improves the bottleneck features it generates. Second, we show that a neural network trained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states. Bottleneck features trained using the proposed methods produced a 16% relative reduction in sentence error rate over conventional bottleneck features on a large vocabulary business search task.

#6 Minimum classification error based spectro-temporal feature extraction for robust audio classification [PDF] [Copy] [Kimi1]

Authors: Yuan-Fu Liao ; Chia-Hsing Lin ; We-Der Fang

Mel-frequency cepstral coefficients (MFCCs) are the most popular features for automatic audio classification (AAC). However, MFCCs are often not robust in adverse environment. In this paper, a minimum classification error (MCE)-based method is proposed to extract new and robust spectro-temporal features as alternatives to MFCCs. The robustness of the proposed new features is evaluated on noisy non-speech sound of RWCP Sound Scene Database in Real Acoustic Environment database with Aurora 2 multi-condition training task-like settings. Experimental results show the proposed new features achieved the lowest average recognition error rate of 3.17% which is much better than state-of-the-art MFCCs plus mean subtraction, variance normalization and ARMA filtering (MFCC+MVA, 4.31%), Gabor filters with principle component analysis (Gabor+PCA, 4.43%) and linear discriminant analysis (LDA, 4.20%) features. We thus confirm the robustness of the proposed spectro-temporal feature extraction approach.

#7 Integrating recent MLP feature extraction techniques into TRAP architecture [PDF] [Copy] [Kimi1]

Authors: František Grézl ; Martin Karafiát

This paper is focused on the incorporation of recent techniques for multi-layer perceptron (MLP) based feature extraction in Temporal Pattern (TRAP) and Hidden Activation TRAP (HATS) feature extraction scheme. The TRAP scheme has been origin of various MLP-based features some of which are now indivisible part of state-of-the-art LVCSR systems. The modifications which brought most improvement - sub-phoneme targets and Bottle-Neck technique - are introduced into original TRAP scheme. Introduction of sub-phoneme targets uncovered the hidden danger of having too many classes in TRAP/HATS scheme. On the other hand, Bottle-Neck technique improved the TRAP/HATS scheme so its competitive with other approaches.

#8 Feature frame stacking in RNN-based tandem ASR systems - learned vs. predefined context [PDF] [Copy] [Kimi1]

Authors: Martin Wöllmer ; Björn Schuller ; Gerhard Rigoll

As phoneme recognition is known to profit from techniques that consider contextual information, neural networks applied in Tandem automatic speech recognition (ASR) systems usually employ some form of context modeling. While approaches based on multi-layer perceptrons or recurrent neural networks (RNN) are able to model a predefined amount of context by simultaneously processing a stacked sequence of successive feature vectors, bidirectional Long Short-Term Memory (BLSTM) networks were shown to be well-suited for incorporating a self-learned amount of context for phoneme prediction. In this paper, we evaluate combinations of BLSTM modeling and frame stacking to determine the most efficient method for exploiting context in RNN-based Tandem systems. Applying the COSINE corpus and our recently introduced multi-stream BLSTM-HMM decoder, we provide empirical evidence for the intuition that BLSTM networks redundantize frame stacking while RNNs profit from predefined feature-level context.

#9 Improved acoustic feature combination for LVCSR by neural networks [PDF] [Copy] [Kimi1]

Authors: Christian Plahl ; Ralf Schlüter ; Hermann Ney

This paper investigates the combination of different acoustic features. Several methods to combine these features such as concatenation or LDA are well known. Even though LDA improves the system, feature combination by LDA has been shown to be suboptimal. We introduce a new method based on neural networks. The posterior estimates derived from the NN lead to a significant improvement and achieve a 6% relative better word error rate (WER). Results are also compared to system combination. While system combination has been reported to outperform all other combination techniques, in this work the proposed NN-based combination outperforms system combination. We achieve a 2% relative better WER, resulting in an improvement of 7% relative to the baseline system.

#10 Hierarchical tandem features for ASR in Mandarin [PDF] [Copy] [Kimi1]

Authors: Joel Pinto ; Mathew Magimai-Doss ; Hervé Bourlard

We apply multilayer perceptron (MLP) based hierarchical Tandem features to large vocabulary continuous speech recognition in Mandarin. Hierarchical Tandem features are estimated using a cascade of two MLP classifiers which are trained independently. The first classifier is trained on perceptual linear predictive coefficients with a 90 ms temporal context. The second classifier is trained using the phonetic class conditional probabilities estimated by the first MLP, but with a relatively longer temporal context of about 150 ms. Experiments on the Mandarin DARPA GALE eval06 dataset show significant reduction (7.6% relative) in character error rates by using hierarchical Tandem features over conventional Tandem features.

#11 Analysis and comparison of recent MLP features for LVCSR systems [PDF] [Copy] [Kimi1]

Authors: Fabio Valente ; Mathew Magimai-Doss ; Wen Wang

MLP based front-ends have evolved in different ways in recent years beyond the seminal TANDEM-PLP features. This paper aims at providing a fair comparison of these recent progresses including the use of different long/short temporal inputs (PLP,MRASTA,wLP-TRAPS,DCT-TRAPS) and the use of complex architectures (bottleneck, hierarchy, multistream) that go beyond the conventional three layer MLP. Furthermore, the paper identifies which of these actually provide advantages over the conventional TANDEM-PLP. The investigation is carried on an LVCSR task for recognition of Mandarin Broadcast speech and results are analyzed in terms of Character Error Rate and phonetic confusions. Results reveal that as stand alone features, multistream front-ends can outperform by 10% conventional MFCC while TANDEM-PLP only improve by 1%. On the other hand, when used in concatenation with MFCC features, hierarchical/bottleneck front-ends reduce the character error rate by +18% relative compared to +14% relative from TANDEM-PLP. The various input long-term representations recently developed provide comparable performances.

#12 Deep learning of speech features for improved phonetic recognition [PDF] [Copy] [Kimi1]

Authors: Jaehyung Lee ; Soo-Young Lee

Recently, a remarkable performance result of 23.0% Phone Error Rate (PER) on the TIMIT core test set was reported by applying Deep Belief Network (DBN) on phonetic recognition [1]. Despite the good performance reported, there is still substantial room for improvement in the reported design in order to achieve optimal results. In this letter, we present an improved but simple architecture for phonetic recognition which uses logMel spectrum directly instead of MelFrequency Cepstral Coefficient (MFCC), and combines Deep Learning with conventional BaumWelch reestimation for subphoneme alignment. Experiments performed on TIMIT speech corpus show that the proposed method outperforms most of the conventional methods, yielding 21.4% PER on the complete test set of TIMIT and 22.1% on the core test set.

#13 Globality-locality consistent discriminant analysis for phone classification [PDF] [Copy] [Kimi1]

Authors: Heyun Huang ; Yang Liu ; Jort F. Gemmeke ; Louis ten Bosch ; Bert Cranen ; Lou Boves

Concatenating sequences of feature vectors helps to capture essential information about articulatory dynamics, at the cost of increasing the number of dimensions in the feature space, which may be characterized by the presence of manifolds. Existing supervised dimensionality reduction methods such as Linear Discriminant Analysis may destroy part of that manifold structure. In this paper, we propose a novel supervised dimensionality reduction algorithm, called Globality-Locality Consistent Discriminant Analysis (GLCDA), which aims to preserve global and local discriminant information simultaneously. Because it allows finding the optimal trade-off between global and local structure of data sets, GLCDA can provide a more faithful compact representation of high-dimensional observations than entirely global approaches or heuristic approaches aimed to preserve local information. Experimental results on the TIMIT phone classification task show the effectiveness of the proposed algorithm.

#14 Front-end compensation methods for LVCSR under lombard effect [PDF] [Copy] [Kimi1]

Authors: Hynek Bořil ; František Grézl ; John H. L. Hansen

This study analyzes the impact of noisy background variations and Lombard effect (LE) on large vocabulary continuous speech recognition (LVCSR). Robustness of several front-end feature extraction strategies combined with state-of-the-art feature distribution normalizations is tested on neutral and Lombard speech from the UT-Scope database presented in two types of background noise at various levels of SNR. An extension of a bottleneck (BN) front-end utilizing normalization of both critical band energies (CRBE) and BN outputs is proposed and shown to provide a competitive performance compared to the best MFCC-based system. A novel MFCC-based BN front-end is introduced and shown to outperform all other systems in all conditions considered (average 4.1% absolute WER reduction over the second best system). Additionally, two phenomena are observed: (i) combination of cepstral mean subtraction and recently established RASTALP filtering significantly reduces transient effects of RASTA band-pass filtering and increases ASR robustness to noise and LE; (ii) histogram equalization may benefit from utilizing reference distributions derived from pre-normalized rather than raw training features, and also from adopting distributions from different front-ends.

#15 Classification of fricatives using feature extrapolation of acoustic-phonetic features in telephone speech [PDF] [Copy] [Kimi]

Authors: Jung-Won Lee ; Jeung-Yoon Choi ; Hong-Goo Kang

This paper proposes a classification module for fricative consonants in telephone speech using an acoustic-phonetic feature extrapolation technique. In channel-deteriorated telephone speech, acoustic cues of fricative consonants are expected to be degraded or missing due to limited bandwidth. This paper applies an extrapolation technique to acoustic-phonetic features based on Gaussian mixture models, which uses a statistical learning of the correspondence between acoustic-phonetic features of wideband speech and the spectral characteristics of telephone bandwidth speech. Experimental results with NTIMIT database verify that feature extrapolation improves the performance of fricative classification module for all unvoiced fricatives by around 10% (relative error) compared to the performance obtained by only acoustic-phonetic features extracted from the narrowband signal.

#16 Noise robust feature extraction based on extended weighted linear prediction in LVCSR [PDF] [Copy] [Kimi1]

Authors: Sami Keronen ; Jouni Pohjalainen ; Paavo Alku ; Mikko Kurimo

This paper introduces extended weighted linear prediction (XLP) to noise robust short-time spectrum analysis in the feature extraction process of a speech recognition system. XLP is a generalization of standard linear prediction (LP) and temporally weighted linear prediction (WLP) which have already been applied to noise robust speech recognition with good results. With XLP, higher controllability to the temporal weighting of different parts of the noisy speech is gained by taking the lags of the signal into account in prediction. Here, the performance of XLP is put up against WLP and conventional spectrum analysis methods FFT and LP on a large vocabulary continuous speech recognition (LVCSR) scheme using real world noisy data containing additive and convolutive noise. The results show improvements over the reference methods in several cases.

#17 Comparing different flavors of spectro-temporal features for ASR [PDF] [Copy] [Kimi1]

Authors: Bernd T. Meyer ; Suman V. Ravuri ; Marc René Schädler ; Nelson Morgan

In the last decade, several studies have shown that the robustness of ASR systems can be increased when 2D Gabor filters are used to extract specific modulation frequencies from the input pattern. This paper analyzes important design parameters for spectro-temporal features based on a Gabor filter bank: We perform experiments with filters that exhibit different phase sensitivity. Further, we analyze if non-linear weighting with a multi-layer perceptron (MLP) and a subsequent concatenation with mel-frequency cepstral coefficients (MFCCs) has beneficial effects. For the Aurora2 noisy digit recognition task, the use of phase sensitive filters improved the MFCC baseline, whereas using filters that neglect phase information did not. While MLP processing alone did not have a large effect on the overall performance, the best results were obtained for MLP-processed phase sensitive filters and added MFCCs, with relative error reductions of over 40% for both noisy and clean training.

#18 VTLN in the MFCC domain: band-limited versus local interpolation [PDF] [Copy] [Kimi1]

Authors: Ehsan Variani ; Thomas Schaaf

We propose a new easy-to-implement method to compute a Linear Transform (LT) to perform Vocal Tract Length Normalization (VTLN) on truncated Mel Frequency Cepstral Coefficients (MFCCs) normally used in distributed speech recognition. The method is based on a Local Interpolation which is independent of the Mel filter design. Local Interpolation (LILT) VTLN is theoretically and experimentally compared to a global scheme based on band-limited interpolation (BLI-VTLN) and the conventional frequency warping scheme (FFT-VTLN). Investigating the interoperability of these methods shows that the performance of LILT-VTLN is on par with FFT-VTLN and BLI-VTLN. The statistical significance test also shows that there are no significant differences between FFT-VTLN, LILT-VTLN, and BLI-VTLN, even if the models and front ends do not match.

#19 Multistream bandpass modulation features for robust speech recognition [PDF] [Copy] [Kimi1]

Authors: Sridhar Krishna Nemala ; Kailash Patil ; Mounya Elhilali

Current understanding of speech processing in the brain suggests dual streams of processing of temporal and spectral information, whereby slow vs. fast modulations are analyzed along parallel paths that encode various scales of information in speech signals. This unique way for the biology to analyze the multiplicity of information in speech signals along parallel paths can bare great lessons for feature extraction front-ends in speech processing systems, particularly for dealing with extrinsic degradations and unseen noise distortions. Here, we propose a multistream approach to feature analysis for robust speaker-independent phoneme recognition in presence of nonstationary background noises. The scheme presented here centers around a multi-path bandpass modulation analysis of speech sounds with each stream covering an entire range of temporal and spectral modulations. By performing bandpass operations of slow vs. fast information along the spectral and temporal dimensions, the proposed scheme avoids the classic feature explosion problem of previous multistream approaches while maintaining the advantage of parallelism and localized feature analysis. The proposed architecture results in substantial improvements over standard baseline features and two state-of-the-art noise robust feature schemes.

#20 An analysis of automatic speech recognition with multiple microphones [PDF] [Copy] [Kimi1]

Authors: Davide Marino ; Thomas Hain

Automatic speech recognition in real world situations often requires the use of microphones distant from speaker's mouth. One or several microphones are placed in the surroundings to capture many versions of the original signal. Recognition with a single far field microphone yields considerably poorer performance than with person-mounted devices (headset, lapel), with the main causes being reverberation and noise. Acoustic beam-forming techniques allow significant improvements over the use of a single microphone, although the overall performance still remains well above the close-talking results. In this paper we investigate the use of beam-forming in the context of speaker movement, together with commonly used adaptation techniques and compare against a naive multi-stream approach. We show that even such a simple approach can yield equivalent results to beam-forming, allowing for far more powerful integration of multiple microphone sources in ASR systems.

#21 Conversational speech transcription using context-dependent deep neural networks [PDF] [Copy] [Kimi1]

Authors: Frank Seide ; Gang Li ; Dong Yu

We apply the recently proposed Context-Dependent Deep-Neural-Network HMMs, CD-DNN-HMMs, to speech-to-text transcription. For single-pass speaker-independent recognition on the RT03S Fisher portion of phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs, to 18.5%.a 33% relative improvement.

#22 Sequential classification criteria for NNs in automatic speech recognition [PDF] [Copy] [Kimi1]

Authors: Guangsen Wang ; Khe Chai Sim

Neural networks (NNs) are discriminative classifiers which have been successfully integrated with hidden Markov models (HMMs), either in the hybrid NN/HMM or tandem connectionist systems. Typically, the NNs are trained with the frame-based cross-entropy criterion to classify phonemes or phoneme states. However, for word recognition, the word error rate is more closely related to the sequence classification criteria, such as maximum mutual information and minimum phone error. In this paper, the lattice-based sequence classification criteria are used to train the NNs in the hybrid NN/HMM system and the tandem system. A product-ofexpert- based factorization and smoothing scheme is proposed for the hybrid system to scale the lattice-based NN training up to 6000 triphone states. Experimental results on the WSJCAM0 reveal that the NNs trained with the sequential classification criterion yield a 24.2% relative improvement compared to the cross-entropy trained NNs for the hybrid system.

#23 Grapheme-based automatic speech recognition using KL-HMM [PDF] [Copy] [Kimi1]

Authors: Mathew Magimai-Doss ; Ramya Rasipuram ; Guillermo Aradilla ; Hervé Bourlard

The state-of-the-art automatic speech recognition (ASR) systems typically use phonemes as subword units. In this work, we present a novel grapheme-based ASR system that jointly models phoneme and grapheme information using Kullback-Leibler divergencebased HMM system (KL-HMM). More specifically, the underlying subword unit models are grapheme units and the phonetic information is captured through phoneme posterior probabilities (referred as posterior features) estimated using a multilayer perceptron (MLP). We investigate the proposed approach for ASR on English language, where the correspondence between phoneme and grapheme is weak. In particular, we investigate the effect of contextual modeling on grapheme-based KL-HMM system and the use of MLP trained on auxiliary data. Experiments on DARPA Resource Management corpus have shown that the grapheme-based ASR system modeling longer subword unit context can achieve same performance as phoneme-based ASR system, irrespective of the data on which MLP is trained.

#24 Direct error rate minimization of hidden Markov models [PDF] [Copy] [Kimi1]

Authors: Joseph Keshet ; Chih-Chieh Cheng ; Mark Stoehr ; David McAllester

We explore discriminative training of HMM parameters that directly minimizes the expected error rate. In discriminative training one is interested in training a system to minimize a desired error function, like word error rate, phone error rate, or frame error rate. We review a recent method (McAllester, Hazan and Keshet, 2010), which introduces an analytic expression for the gradient of the expected error-rate. The analytic expression leads to a perceptron-like update rule, which is adapted here for training of HMMs in an online fashion. While the proposed method can work with any type of the error function used in speech recognition, we evaluated it on phoneme recognition of TIMIT, when the desired error function used for training was frame error rate. Except for the case of GMM with a single mixture per state, the proposed update rule provides lower error rates, both in terms of frame error rate and phone error rate, than other approaches, including MCE and large margin.

#25 On the effectiveness of statistical modeling based template matching approach for continuous speech recognition [PDF] [Copy] [Kimi1]

Authors: Xie Sun ; Xin Chen ; Yunxin Zhao

In this work, we validate the effectiveness of our recently proposed integrated template matching and statistical modeling approach on four baseline systems with increasing phone recognition accuracies in the range of 73% to 78% for the TIMIT task. The four baselines were generated using the methods of 1) Discriminative Training (DT) of Minimum Phone Error (MPE), 2) MFCC concatenated with ensemble Multiple Layer Perceptron (MFCC+EMLP) features, 3) DT combined with the MFCC+EMLP features, and 4) data sampling based ensemble acoustic models integrated with DT and MFCC+EMLP features. Experimental results obtained from template matching based rescoring on the phone lattices generated by the baseline models have shown that our template matching approach has produced consistent and significant improvements over the four baselines, and the highest recognition accuracy was 79.55% obtained from rescoring the phone lattices produced by the ensemble acoustic model baseline.